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Abstract 

In one-dimensional density estimation on i.i.d. observations we suggest an adaptive cross- 
validation technique for the selection of a kernel estimator. This estimator is both asymp- 
totic MISE-efficient with respect to the monotone oracle, and sharp minimax-adaptive 
over the whole scale of Sobolev spaces with smoothness index greater than 1/2. The 
proof of the central concentration inequality avoids "chaining" and relies on an additive 
decomposition of the empirical processes involved. 
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1 Introduction 

For many years, adaptive estimation procedures have stimulated the statistical interest. Such 
estimates achieve minimax convergence rates relying on very little prior knowledge about the 
properties of the curves to be estimated. Oracle inequalities fit into this framework, but 
give much more precise information about the performance of an estimate. They compare 
the risk of an adaptive candidate not to the minimax, but to the best possible risk. Oracle 
inequalities have been proposed for a variety of problems and estimator, Kneip (1994) and 
Donoho, Johnstone (1994) presumably being the first papers to state some. More recent 
examples are Hall, Kerkyacharian, Picard (1999), Cavalier, Tsybakov (2001), Cay (2003), 
Efromovich (2004). 

In the following we will consider oracle inequalities in density estimation, and it was 
wavelet estimators that have received the main attention in this context. During the 90's, 
authors like Donoho, Johnstone, Hall, Kerkyacharian and Picard developed various estimation 
techniques that satisfied more and more refined oracle inequalities. Efromovich also examined 
Fourier series estimates. Of course, even the case of data controlled bandwidth selection 
investigated in the 80's, can be regarded as kind of an oracle problem. But the only source for 
a more general oracle inequality for a kernel density estimator is Rigollet (2004). Remarkably, 
unlike the other oracle inequalities on density estimators, Rigollet's is an exact one. Our 
contribution gives another exact oracle inequality for kernel density estimation, but the two 
do not cover one another in neither direction. 

Rigollet's application of Stein's blockwise estimator to non-parametric density estimation 
is a sharp minimax-adaptive kernel selection rule. The procedure approximates the so-called 
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monotone oracle by the use of kernel functions with piecewise constant Fourier transform. 
The monotone oracle is a pseudo-estimator, which minimizes the quadratic risk (MISE) over 
the class of all kernel functions, whose Fourier transform is real, symmetric and decreases 
monotonously on R + . 

When considering the concept of curve smoothing from the viewpoint of signal recogni- 
tion, a monotone Fourier transform appears to be a natural assumption to a kernel. Given 
that the unknown density is square-integrable, it is equivalent with respect to MISE either 
to estimate the density itself or to reconstruct it from an estimate of its Fourier transform. 
On the other hand it is known that with increasing frequency, random influences overbal- 
ance the true value in the empirical Fourier transform. For this reason, the Fourier series 
projection estimator omits empirical Fourier coefficients beyond a critical frequency. In the 
famous Pinsker-filter, the rigid cut-off is weakened to a monotone shrinkage of the unreliable 
coefficients by the Pinsker-wheights. The focus on kernels with monotonously decreasing, 
but otherwise arbitrary Fourier transform is just a further generalization of the this notion. 

The objective of the present work is to propose a purely data dependent estimator that 
approximates the monotone oracle in an exact oracle inequality. In comparison to Rigollet 
(2004), we abandon the assumption of the kernel's piecewise constant Fourier transform, 
our kernels only being band-limited to [—n, n] and having a monotone Fourier transform. 
Asymptotic exact MISE-efficiency is shown to hold over the set of all bounded, L2-integrable 
densities, which are not infinitely differenciable. Sharp asymptotic minimax-adaptivity on the 
whole scale of Sobolev spaces with smoothness index greater than 1/2 follows automatically. 

There are essentially two quantities to determine the statement of an oracle inequality: the 
set of estimators disposable to the minimization of risk; and the set of true parameters, over 
which the oracle inequality is supposed to hold. Evidently, the larger these sets, the stronger 
the oracle inequality. In a non-parametric setting, regularity conditions are the natural way 
to specify the space of parameters. Classes of estimators considered have been quite diverse 
and cover many familiar non-parametric estimation methods. However, all the classes, for 
which oracle inequalities were proven so far, share an important property - whether fixed or 
growing with the number of observations, their dimension is finite. 

This is natural, when dealing with wavelet or Fourier coefficients. In ordered linear 
smoothers and blockwise Stein's method, the assumptions "ordered" and "blockwise" , respec- 
tively, assure the finite dimensionality. For penalized least squars estimators, the dimension 
is always explicitly determined. 

Note that oracle inequalities rely on special concentration inequalities, because it is nec- 
essary to approximate the maximum of an empirical process indexed by a class of functions; 
functional limit theorems are imperative. With finite dimension we have access to the uni- 
form entropy of the estimator class and chaining arguments provide us with a suitable bound 
for the process. Yet these approximations unavoidably contain the factor dimension in one 
way or another. 

Estimators indexed by kernels with monotone Fourier transform is a class that has obvi- 
ously not finite dimension. But it is known that the set of monotone functions also allows for 
an approximation of its uniform covering number. Unfortunately, the approximations do not 
carry over from the Fourier to the space domain. And so the chaining approach is obstructed 
to us. 

Instead, we pursue an alternative way to approximate our empirical process, namely an 
additive decomposition. The process, indexed by the class of kernels, is decomposed into 
a linear combination of countably many basis processes. Separate arguments as regards 
the basis processes (exponential inequalities) and the size of the non-random coefficients are 
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combined. The resulting threshold is equivalent to those in finite dimensional model classes, 
except that the factor containing the dimension is replaced by ln(n). 

For an outline of the exact procedure, see section 4, appendix Al and A2. The theorem 
along with the hypothesis is formulated in section 2. Section 3 contains the proof of the the- 
orem relying on the proposition that the empirical process can be bounded to an appropriate 
magnitude. Some practical considerations will be found in section 5. 



2 Main results 

Let (X±, X2, . . . , X n ) £ W 1 be an i.i.d. sample with common density function /. Let the 
density / be bounded, ||/||oo < 00, have finite L2-norm, H/H2 < 00, and denote by f(uj) = 
J f(x)e lxuJ dx the characteristic function of /. Let fxix) be the standard kernel estimator 
with kernel K 

1 n 

f K (x) = -^KiXt-x) (1) 

1=1 

and consider the quadratic risk 

MISE(K) = E J (] K {x)- f{xj) 2 dx. (2) 
The cross-validation criterion 

CV(K) := J J 2 K (x)dx - K(X t - X,) (3) 

is an unbiased estimator for MISE up to the summand ||/|||. Let /C be the set of all L2- 
integrable kernel functions with real, symmetric, non-negative and unimodal Fourier trans- 
form K(lu) := J K(x)e %wx dx. For technical reason let ||-K"||2 be < y/n. This does not represent 
a real constraint, since the MISE of a sequence of kernels with L2-norm growing faster than 
n cannot approach 0. 

Define K* to be the MISE-optimal kernel function for / and n among the class /C, i.e. 
the monotone oracle, and let Kq be the CV-optimal kernel function among K, restricted to 
kernels, whose Fourier transform additionally has support in [— n,n]. 

K* := argmin ^MISE(K) K £ /cj 

K := argmin j CV (K) K G K, suppK C [-n,n]} (4) 

Theorem Under the aforementioned hypotheses, for all 5 > the following exact oracle 
inequality holds: 

\E[ISE(K )} - MISE(K*)\ = 0(n.- s )MISE(K*) + O^ 1 ln 5 / 2 n) 

Remark 1 Although the theorem is stated for a fixed density, we could of course let / vary 
in some appropriate set. Investigating the influence of / on the asserted oracle inequality, 
we find that both residuals 0(n~ s ) and 0{n & ~ 1 ln 5 / 2 n) contain constants depending on /: 
namely H/H2 and max /. Obviously, these are uniformly bounded within Sobolev classes Sg(L) 

with smoothness index (3 > 1/2 (Sp(L) <^=^ / <E L 2 and ^ J \u^f(uj)\ 2 duj < L). We will 
explicitly indicate those steps in the proofs, where the dependence enters our approximations. 
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Corollary fx is asymptotically sharp minimax-adaptive on the whole scale of Soholev 
classes with smoothness index greater than 1/2. 

Remark 2 In case the true density / is not infinitly smooth, the assertion of the theo- 
rem is equivalent to a general MISE-efnciency, analogously defined to Hall (1983) and Stone 
(1984): 

E[ISE(K )] 
MISE(K*) ' 



3 Proofs 



Proof of the Theorem First of all, it can bee seen that for L2-hitegrable / the difference 
between MISE(K*) and the MISE of a truncated version of K* is negligible in proportion 
to MISE(K*). So the minimization of MISE on K, is equivalent to that on 



/C, 



{K G K, | suppK C [-n,n]} 



Next let us assume the following propositions, the validity of which will be shown in section 
4 by wavelet decomposition of the empirical processes: For any A < oo, there exists a set 



A n C M n , such that for an arbitrary observation X 
holds that: 



(Xi,..., X n ) G A n and for 5 > it 



Al \ISE(K) - 
A2 \ISE(K) - 
A3 P(Xe A c r 



CV(K)\ = 0{n' s )MISE{K) + 0(n 5_1 ln 5/2 n) 
MISE(K)\ = 0(n- 5 )MISE{K) + 0{n s ~ l ln 5 / 2 n) 
) = 0(n' x ) 



where 0(n _<5 ) and 0{n 5 ~ l ln 5 ^ 2 n) do not depend on K. In case / is a density function that 
can only be estimated at a rate n e_1 , n~ 5 MISE(K) will dominate n" 5-1 for small enough 
5 > at the right-hand side of these equations. Otherwise, if either S is too big or if / can 
be estimated at a faster rate, the term n 5_1 ln 5 / 2 n will be dominating. 

CTisa criterion derived from CV, such that CV{K) - CV(K') = CV{K) - CV(K') 
for any K,K' in /C, and will be defined below. In addition, it holds that: ISE(K) < 
(\\K\\2 + H/H2) 2 < {n 1 / 2 + ||/||2) 2 - As a consequence, we can proceed in the following way: 

E[ISE(K )} - MISE(K*) 



E 



ISE(Kq) - ISE(K* 



< E An ISE(K ) - ISE(K*) + P (A c n ) sup ISE(K) 



= E An 



ISE(K ) - CV(K ) + CV(K ) - CV{K*) + CV(K*) - ISE(K*] 



+ 0(n- A ) (v^+||/|| 2 ) 2 
< E An \lSE(K )-CV(K ) 



+ + Ea 



CV{K*) - ISE(K*) +0(n- x+1 ) 



O(n' 5 )E An [MISE(K )} + 0{n~ 5 )MISE(K*) + O^" 1 ln 5 / 2 n) + 0(n~ A+1 ) 
0(n- s )E An [ISE(K )} + 0(n- s )MISE(K*) + O^ 1 ln 5 / 2 n) 



(A3) 

(Al) 
(A2) 
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for A sufficiently large. In order to return to E[ISE(Kq)], we exert again proposition A3 so 
as to find \E An [ISE(K )] - E[ISE(K )]\ = 0{ n - x+1 ), and therewith 

E[ISE(K )} - MISE(K*) = O(n~ s )E[ISE(K )} + 0( n - s )MISE(K*) + C^n 5-1 ln 5 / 2 n) 
E[ISE(K )} = (l + 0(n- 5 )^ MISE(K*) +0(n 5 - 1 ln 5 / 2 n) 

By A2, the opposite is also readily shown: 

MISE{K*) - E[ISE(K )} = O( n - S )E[ISE(K )} + O^ 1 ln 5 / 2 n) 

=► MISE(K*) = (l + 0(n- s )^j E[ISE(K )} + O^ 1 ln 5 / 2 n) 

which implies the desired result: 

\E [ISE(K )]-MISE(K*)\ = 0{n- s )MISE{K*)+0{n 8 - 1 ln 5 / 2 n) □ 

Proof of the Corollary The minimax risk of density estimation in Sobolev classes Sp{L) = 
{/ e i 2 |||/ (/3) ||i < L}, where (3 € N+ and L < oo, is known since Efroimovich, Pinsker (1983). 
It is also known that kernel estimators employing suitable kernels maintain the minimax risk. 
One of these so-called minimax kernels is Kp with 

K $ {x) = f g { ^ X xl+1 and k,(u) = (l - \uf 

Obviously, the Fourier transform of any Kp, j3 G N + , is unimodal, so it is contained in /C. That 
means, the MISE-optimal estimator (monotone oracle) fx* cannot be worse than the minimax 
estimator ficp- On the other hand, the CV-optimal estimator fx is asymptotically as good 
as fx*, where the convergence is uniform on Sobolev classes with (3 > 1/2, as emphasized in 
Remark 1 in section 2. It follows that fx is asymptotically minimax simultaneously on the 
scale of Sobolev classes Sp(L) with (3 G N + . 

The consideration of the minimax risk of density estimators can be extended to Sobolev 
type classes with non-integer smoothness index (3 G M + , defined as 



Sp(L) : = j/ G L 2 |J- J \^f{u)\ 2 dw^ 



For (3 > 1/2, both the minimax risk and the minimax kernel Kp take forms analogous to those 
in ordinary Sobolev classes, although the proofs have to be adjusted (see Dalelane (2005)). 
The same idea as before leads to simultaneous asymptotic minimaxity of fx on the whole 
scale of Sobolev type classes Sp{L) with (3 6 M + , (3 > 1/2 . □ 



4 The empirical process 

As the proof of proposition A2 is very much the same as the one for Al, we confine ourselves 
to a demonstration of how \ISE(K) — CV(K)\ can be approximated by 0{n- 5 )MISE(K) + 
O^ 6 ^ 1 ln 5 / 2 n) simultaneously_over K, n . The first step towards this goal will be to split up the 
difference between ISE and CV into two empirical U-processes indexed by K n , a degenerate 
U-process of order 2 and a U-process of order 1, i.e. a partial sum process. This splitting 
was already observed in Stone (1984), where the class of kernels consists but of one rescaled 
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kernel function: fix = {Kh\h > 0}. Obeying some assumptions on K, it is easy to bound 
the uniform covering number of fix, see Nolan, Pollard (1987). Chaining arguments apply 
to both the partial sum process and the empirical U-process. But for lack of an appropriate 
approximation on /C n , a generalization of Nolan/Pollard's proof is not possible. 

Instead, we define a wavelet inspired function basis for /C n , such that every kernel K £ /C n 
can be represented as a linear combination of the functions belonging to this basis. The linear 
decomposition is carried forward to the space of U-statistics made up by /C n , such that each 
U-statistic in K is a weighted sum of all (countably many) U-statistics of the function basis. 
The values of the basic U-statistics can be controlled by means of exponential inequalities. 
On a set of "favorable events" with overwhelming probability (proposition A3), they do 
not exceed a comfortable threshold of n _1 Aln 3 / 2 n. In turn, due to the unimodality of the 
kernels' Fourier transforms, we can bound the absolute sum of the (non-random) wavelet 
coefficients, assigning a linear combination of basic U-statistics to a given U-statistic in K, 
through In n 1 1 -ff 1 1 2 - Combining these arguments, we find that any U-statistic in K is an 
0(n _1 ln 5 / 2 n)||.ft:||2 = 0{n' 1 / 2 ln 5 / 2 n) ^ 'MISE(K), the O's neither depending on K nor on 
/■ 

To derive the desired bound of 0(n S )MISE(K*) + 0(n s 1 ln 5 / 2 n) therefrom, we have 
to differentiate several constellations between the true density / and the envisaged 5. Recall 
that the monotone oracle-kernel K* is not random and depends on nothing but / and n. 

First consider / such that there exist constants < If, Uf < 00 and £f > 0, which satisfy 
Iftff- 1 < MISE{K*) < u/ ■ n £ ' _1 . If 5 < e f /2, then we have immediately 

0(n- 1 ' 2 \tf>l 2 n)^MISE{K*) < 0(n- & )MISE{K*). 

If otherwise 5 > e/2 holds, it follows that 

0{n^ 2 \n b ' 2 n)^MISE{K*) < 0{n S ~ l \n b ' 2 n). 

This second reasoning is also true, when the convergence rate of MISE(K*) is inferior to 
n e_1 for any e > 0, i.e. if the density / has infinitely many derivatives. 

By a similar procedure but employing a different function basis, we also approximate the 
partial sum process. But this is already proposition Al. 

To be exact, let Xi, . . . , X n be distributed as assumed in section 2. Let X and Y denote two 
further random variables with the same distribution, independent of X±, . . . , X n and of each 
other. 

ISE(K) := J (j K (x)- f{x)f dx = j f 2 K (x)dx - ^-^E [K(Xi - X)\Xi] + E [f(X)} 

i=i 

CV{K) = lf 2 K ( x )dx--^—^2K(X l -X J ) 

We obtain CV from CV by adding a zero and a further term which does not depend on K. 
Define I n (u)) := I{\oj\ < n) and 

h f (x) := ^-JfWfl-Ini^y-^duj (5) 
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the high-frequency contribution of / to /. 

.n 



CV{K) := CV(K)+[-Y J E[K{X-X J )\X ] ]-2E[K(X-Y)]--Y J E[K{X-X J )\X J ] 

3=1 3=1 

+ 2E [K(X - Y)}] + -52(f(Xj) - h f (Xjj) - 2E[f(X j ) - h f (Xj)_ 



3=1 

We can now split up the difference between the quadratic loss and the cross-validation crite- 
rion into two summands: 

ISE(K) - CV(K) 

= -\T. E - Y ^ + E + ^hi) E K ( Xi - x i) 

i=l { ' i+3 

n n 

- Ys E ^ X ~ X i)\ X A + 2E ^ X ~ y )l + - E E ^ X ~ X i)\ X A 



n 



3=1 



3=1 



2 

2E [K(X - Y)\ - - " M*i)) + 2E[f(Xj) - h f {X 3 ) 



3=1 



n(n bl) E( K ^ " X i) ~ E l K ( X * ~ X ^ X i\ ~ E i R ( X i ~ X 3)\ X 3 
i+3 



2 

+ E [K{X t - X,)}) + - J2(E [K(X - Xj)\Xj\ - fiX,) + h f (Xj 

3=1 

- E [K(X - Xj)} - E[f(Xj) - hfiXj)] ) 



1 ' i+3 3=1 

where bx stands for the bias Efx — f- The first term corresponds to a degenerate U-statistic, 
since E[U K {X,Y)\Y] = E[U K (X,Y)\X] = E[U K (X,Y)] = for all values of X and Y. In 
appendix Al, we will define a basis of father and mother wavelets for fC n , which allows the 
following decomposition: 

K(x) = Y j a t (K) ^ t (x) + Y J Pst(K) iP s t(x) 

t s,t 

This decomposition can also be assigned to the U-statistics, such that 



,, (n^l) E Uk^Xj) :- 

i+3 i+3 



J2<*t(K) U^X^X^ + J^fotiK) U^X^Xj 



s,t 



A change of summation separates the stochastic processes from the deterministic coefficients. 
1 ^U K (X uX] ) = J>(if) -J—^u^Xj 



n(n — 1) 



s,t 



n(n - 1) tr' 

1+3 



7 



The basic U-statistics can be kept "small" on a set of "favorable events" A n ± C W 1 (see 
appendix Al), and in Lemma 1 we find bounds for the wavelet coefficients, so that on A n \ 
the following holds 



(7) 



and for sufficiently large A < oo, equation (10) in Lemma 2 shows that 

PK 1 ) = 0(n" A2/3+1 ) 

On the other hand, we obtain in (6) a partial sum in bx + hf. Because of the bounded 
support of K, bx + hf takes the form: 



b K {x) + h f {x) = f*K(x)-f(x) + h f (x) 



c dco + 



2vr 



/H(i-/ n ( w ))< 



= hi f^)(K(uj)-l)l n (uj)e-^cL; 



It is the low-frequency component of the bias and exactly that part which really depends on 
the kernel. In appendix A2, the partial sum is bounded on another set of "favorable events" 
A n2 C 



1 n 

~ T,(bK(Xj) + h f (Xjj) - E[b K {X 3 ) + hfiXj)]) 



= O 



ln 2 n 



b 2 K {x)dx + 



n 



(8) 



and in equation (12) Lemma 4 we see that 



0(n 



-\+u 



The intersection of these^two sets of "favorable events" A n \ n A r a =: A n is the one used 
in section 3 to bound CV-ISE (on the very same set A n , ISE-MISE can be bounded to an 
identical size). 

The threshold for the U-statistic is of order ra' 1 / 2 \n 5/2 n(y/MISE(K) + n" 1 / 2 ). And the 
one for the bias is of order n -1 / 2 ln 2 n(-y/ ' MISE(K) +n~ 1 / 2 ), but depends on ||/||2- When 
H/H2 is uniformly bounded, as in Sobolev classes with smoothness index greater than 1/2, 
also this approximation is uniform. Besides, MISE converges in any case not faster than rT 1 . 
Hence: 



< 2 



\ISE(K) - CV{K)\ 
1 



n(n — 1) 



Y^Uk^X^ + 2|- + hf(Xj)) -E[b K (Xj) + h f (X 3 

i+3 j =1 



o 



'in 5 / 2 



11 



n 



I J K 2 (x)dx + l) + o(^£\ Uj b\(x)dx + 



n 



8 



= o 



= o 




) 



which concludes the proof of proposition Al and 



P{A c n ) < P{A^) + P{A c n2 ) 



0{n~ x ') 



for an appropriate A' < oo 



which is proposition A3. 

5 Practical computation 

Once the statistical properties of the CV-optimal kernel function Kq have been examined, we 
would like to actually compute this kernel from a sample X±, . . . ,X n . Kq is argmin CV{K) 
within the set JC n := {K 6 /C|supp K C (— n,n)}. Hence we face a minimization problem. 

Note that the set JC is convex. With respect to the properties K reel and non-negative and 
ll-^lll ^ n > convexity is obvious. Given that all K in K, are unimodal and symmetric around 
0, their mode is 0. And a convex combination of any two K is again unimodal. Convexity 
is also preserved through the trimming of the support of K. On the other hand, CV(K ) is 
a strictly convex function. Therefore minCV(K) over K n is a convex optimization problem, 
where the argument is itself a non-increasing function, K: [0, n) — ► [0, 1]. 

Convex problems have a unique solution, so we are theoretically save. The question is of 
course to find the solution. Consider a discrete version of /C n , say K? n , which contains all real, 
symmetric and unimodal piecewise constant functions on [0, n), with jumps at the points 
2~*k, k = 1 . . . 2 t n, and values G [0, 1]. The minimization of CV(K) over /C* is still a convex 
optimization problem, but this time with respect to a parameter of dimension 2 t n (number 
of variables) and with 2*n + 2 constraints (unimodality, positivity and L2-norm). 

The Li-distance between a kernel function K in )C n and its closest neighbor K l in K} n is 
not greater than 2~* (and thus the same applies for the supremum distance between K and 
K*). It follows with little effort that \CV{K) - CV(K*) < f ■ 2~* and therwith 



Since /C* C /C* t +1 , the sequence {Kq}^ converges towards Kq, the unique solution of the 
original problem. 

There is no doubt that a profound analysis in terms of optimization would yield a more 
sophisticated algorithm to solve to the problem, possibly avoiding discretization and giving 
convergence rates over classes of densities. 

A Appendix 

A.l Wavelet decomposition of the kernel As the class K. n itself, also the desired basis is 
constructed in the Fourier domain. We are searching for a way to compress most economically 
the information inherent to K. To this end, we utilize K's assumed monotony on R + , which 
gives that for \\K\\2 fixed, K(uj) < \\ K\\2\2oj\~ 1 / 2 must hold. Heuristically spoken, the further 



CV(Kl) := mmCV{K) < CV {{Kq) 1 ) < CV{K ) + --2 



-t 



"n 
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out we reach on the line M + , the smaller will be the variation in K. But that means, we can 
allow for a rougher approximation without losing much of our approximating power. 

Technically we implement the idea as follows: Inspired by the well known Haar basis, 
symmetric father wavelets are defined on the interval [— n, n]: ipoi(uj) := 2~ 1 / 2 /(|oj| G [0,1)), 
¥>02(^) '■= 2~ 1 / 2 /(|o->| G [1)2)). After that, we let the supports of the wavelets grow: with 
negative scale index, we define £- s , 2 (w) := 2-( s+1 )/ 2 /(>| G [2 S ,2 S+1 )), 1 < s < d n , where 
d n ~ Inn. 
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father wavelets 



The sequence of father wavelets {(poi, ipo2, <P-i,2, $-2,2, ■ ■ ■ , <P-d„,2) covers the whole in- 
terval [— n,n], (the support of a function being defined as the closure of the set, where it is 
nonzero) and comprises d n + 2 elements. On the supporting interval of each father wavelet, 
the mother wavelets are defined on refining scales. With notation I u t(uj) := I(\uj\ G [2~ u (t — 
l),2~ u t)), the mother wavelets on (-2 S+1 , -2 S ]U[2 S , 2 s+l ) are Vy 4 (u;) := 2^ u ' l ^ 2 [I u+1 ^i{^) 
-I u+1 ,2t(u)}, u = -s, -s + 1, . . . , 0, 1, 2, . . . and t = 2 S+U + 1, . . ! , 2 s+u+l . When we combine 
all mother wavelets with the same scale index s, we arrive at a sequence of (ips,2, ■ ■ ■ > *Ps,2 s n) 
for s = —1, . . . , —d n , and (ip s i, • • • , ip s ,2 3 n), for s > 0. We observe that for s < 0, the corre- 
sponding mother wavelets do not cover the whole interval [— n, n], but only [— n, —2 s ] U [2 s , n]. 



5 10 
mother wavelets, scale 



15 



10 



15 



mother wavelets, scale -1 



10 



5 10 
mother wavelets, scale -2 



15 



Unifying the notation: 

<?st{u) 



= /(H G [2" s (t- l),2~ s t)) 
= 2(- 1 )/ 2 / st (c) 

= 2( s - 1 )/ 2 [J s+1 , 2t _ 1 (o;)-J s+1 , 2t (a;)], 



we have the following complete orthonomal function basis of L 2 ((— n, n)): {^01} U {¥>s2|s = 
0, . . . , -ri n } U {^Ja = -1, . . . , -d n and f = 2, . . . 2 s n} U {$ st \s > and t = 1, . . . , 2 s n}. The 
decomposition of K results in: 



oo 2 a n 



—d n —d n 2 s n 

K{u) = a 01 {K)(p Q1 (u) + Y,Us2{K)(p s i(u)+Y,Y,P^ K ^ 

s=0 s=-l t=2 s=0 t=l 

a st (K) := y (p s t{u)K(u)du and /3 s t(i^) := ^ ^ st {uj)K (uj)duj 

(K and the wavelets are both symmetric, so conjugation can be dropped.) By an inverse 
Fourier transferred, the additive decomposition of K can be transformed to the space domain. 



— d n 2 a n 



oo 2 s n 



K(x) = a 01 (K)<p 01 (x) + Y,ns2(K)<p s2 (x)+ £ ^^W^W + EE^W^W 



s=0 s=-l t=2 

Accordingly, the summands in the U-process decompose into: 



s=0 t=l 



-d„ 2 s n 



s=-l t=2 



U K {Xi,Xj) = a 01 (K)U v , 01 (X i ,X j ) + J2<Xs2(K)U V)s2 (X i ,X j )+ ^ ^ f3 s t{K)U^, st (X i: Xj) 

oo 2 s n 

+EE^TO s <( x " x i)> 



s=0 t=l 



where U^JX^Xj) := ip^Xi - Xj) - E^X, - Xj)\Xi] - - Xj)\Xj\ + E[y st {X i - 

Xj)], and U^ st equally defined for ip st . Interchanging the order of summation, we obtain that: 



/ — ^ UK(Xj,Xj 
n(n - 1) ^ 



a i(K) 



n(n — 1) ^— ' 



+E«^w 



s=0 
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-d n 2 a n 

+ E E^t(K) 

s=-l t=2 



n(n — 1) 



oo 2 a n 



s=0 t=l 



(9) 



From this point onwards, the sums of wavelet coefficients and the U-statistics can be handled 
separately. The a's and /3's are deterministic and we show in Lemma 1: 



—dn / 7 

a 01 (K)\ + ^2\a s2 (K)\ < J 2tt J K 2 {x)di 



s=0 

2 s n 



Y^\P*t( K )\ ^ ^2ir J K 2 (x)dx fors<0 

2 3 n 

J2\Pst(K)\ < 2(- s+1 )/ 2 fors>0 



t=i 



For a suitable constant A < oo, we choose our set of "favorable events" as: 
A nl :=\(X l ,...,X n ): £ U^X^Xj) < (a, t) = (-d n , 2), . . . , (0, 2), (0, 1); 



n(n-l) S U^ at (Xi,Xj) 



n(n-l) Yj U ipst( X i, X j] 



<Ah£^ ; s = -d n ,...,-l,t = 2,...,2'n; 
< Aii f ±£ , s>0, t = l,...,2 s n} 



whereupon the U-statistics do not become excessively large. The fact that the complement 
of the set A n \ has probability tending to 0, as n — ► oo, P{A c nl ) = 0(n~ x 1 +1 ) (uniformly 
for / G Sp{L) with /3 > 1/2), will be shown in Lemma 2, equation (10). On A n \ it holds that 
(in connection with (9)): 



, — —y2u K {Xi,Xj 

n(n — 1 



< 



A In 3 / 2 



n 



-d n 2 s n 



s=0 



s=-l t=2 



2 s n 



+£^±i£lA.(*)l 



< 



s=0 

Aln 3 / 2 n 



n 

oo 



= o 



Vdn + 2 y 2vr y K 2 (x)dx + d„y2vr ^ K 2 (x)d 
y> Alnn + s 2 (_ s+ i)/2 



which completes (7). But two assertions are still left to be verified. 
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Lemma 1 For the father and mother wavelet coefficients of K defined so far, it holds 
that 



a 01 (K)\+^2\a s2 (K)\ < y^n + 2 J2vr / K 2 {x)dx 



=o 

2 s n 



Y^\Pst{K)\ < y 2tt y K 2 (x)dx fors<0 
J2\Pst(K)\ < 2 (- s+1 )/ 2 fors>0 



t=i 



Proof Since (p s t are orthonormal, we can deduce through Cauchy-Schwartz: 



a 01 (K)\ + ^2\a s2 (K)\ = K{u) 

s=0 J 



£01 M + ^2fis2(u) 



s=0 



< y J K 2 (uj)dLO + 2 

May T\(K\supp I s t) denote the total variation of K on the support of I s t, and the like for 
max and min. It is known that for mother wavelet coefficients, it holds that: 

Y^\(i st (K)\ = 2-(* +1 )/ 2 TV^|Usupp/ s ^ 

For s > 0, the supports of the mother wavelets cover the whole interval [— n, n], and we obtain 
TV(K\ (Jsupp7 st ) = TV(K) < 2, due to unimodadility of K (K(0) = J K(x)dx = 1). For 
s < 0, |J supply = [— n, —2 s ] U [2 s , n], and we use that 

/[■u)q rn ru>o 

K 2 (uj)duj = / K 2 (u)dw + / K 2 {uj)duj < \ K 2 (io )duj = lo K 2 (uj ) 
JO Ju) JO 



This yields 



Y^\Pst(K)\ < 2-( s+1 )/ 2 TV^|[Jsupp/^ 
< 2-( s+1 )/ 2 2K(2- S ) 



< 2-^1 



V2 • 2- 



K 2 (io)dw 



□ 



Lemma 2 For the father and mother wavelets defined above and arbitrary < A < oo it 
holds that 

^^^^|E^( X -^)|>^V^j = 0(n- A2/3 ) for ( S ,t) = (-d„,2),...,(0,2) 
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> 



A In 3 / 2 



n 



n 



and (0, 1) 

0(n~ A2/3 ) for s = -l,...,-d n 
and t = 2, . . . , 2 s n 



> 



A In n + s 



n 



j = o(n- A2/3 e- s ) for s = 0,1,2,... 
and t = 1, . . . , 2 s n 



These bounds 0(.) are uniform in s and t. 

A c nl is the union of all complementary sets and the approximations of Lemma 2 give 



P(A c nl ) = 0(n- A2/3 ) 



(d n + 2)+ E 1 



s=-l t=2 



oo 2 s n 



+ 



EEo(»- 1!,, =-) 



s=0 t=l 



= 0(n- A2/3 )0(lnn + n) + 0(n- A2/3 )0(n) 



(10) 



Remark As we will see in the proof, the bounds of Lemma 2 are uniform in function sets 
with bounded max/. This is the case for Sobolev classes Sp(L) with (3 > 1/2. So over 
Sobolev classes, Lemma 2 holds uniformly. 

Proof From the Bernstein type inequality for degenerate U-statistics, shown by Arcones, 
Gine (1993), it follows that for all ip st , and analogously for all tp s t with s < 0, there exist 
constants c\ and C2 independent from (p st (and from ip st respectively), such that: 



> 



A In 3 / 2 



n 



n 



< c\ exp < 



Aln 3 / 2 » 



< c\ exp < 



O exp 



C2 ( n _l)Ah^!n 



Aln 3 ^n\ 1/3 



1/2, 
oo | 



1?, II I Z'"- 1 1 II ^ ||2 Aln 3 / 2 n A 

27r ii^n- nVotlb + ^— (i^ll^tlli — ^ — ) 

A 2 / 3 Inn )\ 



1/3 



l + ll/ll^ln-^n 



which is an 0(n a2/3 ), not depending on s and t. By analogue calculations, we get for ip st 
with s > 0: 



an 0(n~ A2/3 e~ s ), uniform in t. 



A In n + s \ 


= O ^exp | 


> „ 







A 2 / 3 Inn + A _1 / 3 s 



1/2 



□ 
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A. 2 Wavelet decomposition of the bias __We are now going to apply an additive de- 
composition to the bias term in the difference CV(K) — ISE(K): 



n 

3=1 



where bx(x) = f * K(x) — /(x), hf is the high-frequency component of / (definition (5)) and 
bx + hf = bx ■ I n - In the bias, everything relates to the underlying density, so we construct 
basis functions depending on /. Let us define the integral of |/| 2 over [— u>,u>] as a function 

F( U ) := ^ \f(r)\ 2 dr 

This map transforms the w-halfaxis [0, oo) by mapping to i — ► F(co) to the interval [0,11/111). 



/n 
\f(r)\ 2 dr, lim F{ 
-n 



The initial value of an interval, say [2~ s (t — l)F n , 2~ s tF n ) with length 2~ s , on this axis is the 
interval [-F _1 (2~ s (i — l)F n ), F~ 1 (2~ s tF n )) on the original axis. The integral of |/| 2 over the 
initial interval is obviously \ 2~ s F n . 

2- s F n = / |/H| 2 ^- / \f(u)\ 2 dou 

J-F- 1 (2-HF n ) J-F- 1 (2- = (t^l)F n ) 

Define the indicator functions: 

I'sM ■= l(He[F- 1 (2- s (t-l)F n ),F~ 1 (2~ s tF n ))) 

satisfying J \f{uj)\ 2 I' st {uj)duj = 2~ s F n , and the orthonomal wavelet functions: 

fttH := 2 s / 2 F- 1 / 2 /H4H, for s = 1, . . . , s n with t = 2 s -land 

s = s n ,t = 2 s " where s n ~ Inn 

4,' st (u) := 2»/ 2 F-V2/( w )[r a+1)M _ 1 ( w )-r 8+1)2t ( w )], for s = l,...,s n -l with 

i = 1, . . . , 2 s — 1 and s = s n , s n + 1, . . . with 
i=l,...,2 s 

{^Js = 1, . . . s n , t = 2 s - 1} U {ft„ )2 .„} U {4>' st \s = 1, . . . , s n - 1, t = 1, . . . 2 s - 1} Uj^Ja > 
s n , i = 1, . . . 2 s } represent a complete orthonormal basis for the set of all functions { f -g-I n \ g 
€ L 2 }, which the bias functions bx • I n belong to for all K G JC. After the inverse Fourier 
transform, we have 

Sn Sn 12 s 1 

b K (x) + h f (x) = ^a' s2s _ 1 (b K )<p , s2s _ 1 (x) + a , Sn2 s n (b K )<p , Sn2 s n (x)+^2 ^2 P'sA b K)^' st (x) 

s=l s=l t=l 

00 2 s 

s=s n t=l 



15 



which gives in turn 
1 n 

- E(M^) + M*i)) -E[b K (Xj) + /i/pT,) 
j'=i 



s=l 



X 



^ ^2.-l(^) -V i2 .-l(^) 
3=1 

l -iZ^Ax 3 ) - E^' Sn23n (X 3 ) 



s„-12 s -l 



+ E E 



s=l t=l 



^E^ t (^)-^ t (^-) 



00 2 s 



s=s„ t=l 



(11) 



Again, we will proceed separately with the aim of finding bounds to the deterministic wavelet 
coefficients and the stochastic processes. Lemma 3 shows that 

E K2*-l(MI + K„2 S " (b K )\ < V-Sn + 1 y^TT J b\(x)dx 

2 s — 1 / 

E \Mk)\ < 2 y 2vr y ^(s)dx for s < s n 

2 s 

El^(MI < 2-2- s / 2 ||/|| 2 fors>s n 



Over a set of "favorable events", whose complement has an asymptotically decreasing prob- 
ability (Lemma 4, inequality (12)), the partial sum processes can be controlled. For A < 00 

{n 
(X u ...,X n ) : I| < ^p> (s,t) = (l,l),...,(^n,2 s "-l),( Sn ,2 s "); 

£l E - < , S = 1, • ■ ■ , s n - l,t = 1, . . . , r - 1; 

3=1 

l\t^t(X 3 ) - E^ st (Xj)\ < ^t*, a > s n ,t = 1, . . . ,2 s ) 

and P(^ 2 ) = 0(n- A+1 ) 
Following (11) and taking into account that 2 Sn < n _1 , it holds on A n 2: 



< 



n . 

3=1 

A Inn 

y/n 



-|E(^(X,) + h f (Xjj) -E[b K (X 3 ) + h f {Xj) 

Sfi Sn 1 2 1 

EI4-iWl + K 2 -M + E E l&M 



.5 = 1 



= 1 t=l 



A In n + s 



— 7-— El&M 

s=s n v t=l 



In n 



bj < (x)dx + 
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which completes (8). Now we proof the remaining assertions. 



Lemma 3 The coefficients of the bias defined trough the /-depending function basis satisfy 



Y^\ a 's2^-i( b K)\ + \a' Sn2 s n (bK)\ < Vs^TT y 2vr y b 2 K (x)dx 



2 S -1 / 



for s<s n 

J2\(3' st (b K )\ < 2-2- s / 2 ||/|| 2 for S > Sn 



2 s 



t=l 



Proof The father wavelet coefficients are bounded in the same way as in Lemma 1, such 
that 



(b K )\ = J J \b K (io)\ 2 dw v^TT 

For every t in the summation range, choose an arbitrary uj st G [F~ 1 (2~ s (t-l)F n ), F~ 1 (2' s tF n )). 
Again let TV(-fT|supp/^) be the total variation of K over the support of I' st . 

t t J 

= ^2 2S/2F n 1/2 \ I f^) (l - K{U)) J{U) [I' s+ l,2t-l(") ~ I's + 1,2M 
t J 

= Y, 2S/2F n 1/2 \ I l/MI 2 (i - KM) [W-iH " WH] ^ 

t J 
+ J \f(oo)\ 2 (K(u st )-K(u)) [I 

< Y, 2 ^ 2 ^ 2 [(l " KM) j \f(^)\ 2 [I' s+1 , 2t -l(") ~ I's + 1,2M 

t L J 

+ j |/H| 2 TV (k\ suppi; +1)2t _ 1 ) 

+ 1 |/H| 2 Tv(K|supp/^ li2 ,(|u;|))/; +li2 ,H^ 

= E2 s / 2 F-V2r + Tv(K|supp4) f \f(u)\ 2 l' st (u)dco 

t <- 
= ^2 s / 2 F- 1 /2 tv (£| supp4) F n 2- S 

t 

= 2-/ 2 FV 2 TV^|supp[J/^ 

The mother wavelets on the scales s > s n are defined over the whole interval [— n, n], therefore 
TV(K\ supp|J4) = TV(K) < 2. For s < s n , the mother wavelets are supported on 
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[-F 1 ((1 — 2 s )F n ),F 1 ((1 - 2 s )F n )]. On this interval, the total variation amounts to at 
most 2[1 - K (F _1 ((l - 2 s )F n ))]. 

< 2^/ 2 F T y 2 2[l-K(F- 1 ((l-2 s )F n ))" 

t 

= 2-' 2 fV 2 (J |^ s2s (o;)| 2 ^2[l-^(F- 1 ((l-2 s )F n )) 



= 2 s ' 2 F- 1 ' 2 J |/M| 2 /; 2S ( W )^ 2[l-K (F- 1 (2- S (2 S - l)F r , 
< 2 s ' 2 F- 1 ' 2 2 J f(u) [l - K(u)] J{u)l' s2s {u)dw 
= 2 b K (uj) <f' s2 s(uj)dLO 



< 



2^ J \b K (u)\*dw \jj W s2s {^dw 
2 J f \b K (co)\ 2 dw 



□ 



Lemma 4 For any A < oo, the following inequalities hold uniformly for all indicated s and 
t and, exactly as in Lemma 2, as well uniformly for / G Sp(L), (3 > 1/2: 



(i i n 
3=1 

/ 1 I " 



> 



A Inn 



n 



> 



A Inn 



n 



A In n + s 



n 



o(n' A ), (s,i) = (1,1),...,( S „,2 S " - 

and {s n ,2 Sn ) 
o(^n~ x ^J, s = 1, . . . , s n -i and 
t = l,...,2 s -l 
s = s„, s„ + 1, . . . and 



A„-s 



t = l,...,2 s 

t4^ 2 is the union of all complementary sets and the approximations yield 



P(^ 2 ) = 0{n 



s„-12 s -l 



.5 = 1 S=l t=l . 

= 0(n" A )0(lnn + n) + 0(n~ A ) 
so P(A£ 2 ) is less than an 0{rT x+1 ). 



oo 2 s 



+ EEo(n-v) 



s=s„ t=l 



(12) 



Proof According to Bernstein's inequality (e.g. Shorack, Wellner (1986), p. 855), for 
all ip' st and analogously for all ip' st with s < s n it holds that: 



P \\\jZ^ X 3)- E ^'st{X 3 ) 
\ 3=1 



> 



A Inn 
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n I A Inn 



< 2 exp < 

< 2 exp | 
= O I exp 



2 V VTi 



T?\,r,l 12 i II//,' II A Inn 

A 2 ln 2 n 

limi + Xlnn 



A Inn 

l + ll/llooA-iln- 1 ^. 



which is a uniform 0(n A ). For ip' st with s > s n : 



/ 1 I " 

V 3=1 



> 



A In n + s 



O ( exp 



Xlnn + s 



l + imUA-lln- 1 ™ 



a uniform 0(n A e 



□ 
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